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ABSTRACT 

High- volume feature-rich data sets are becoming the bread-and-butter of 21''* century 
astronomy but present significant challenges to scientific discovery. In particular, iden- 
tifying scientifically significant relationships between sets of parameters is non-trivial. 
Similar problems in biological and geosciences have led to the development of systems 
which can explore large parameter spaces and identify potentially interesting sets of 
associations. In this paper, we describe the application of automated discovery systems 
of relationships to astronomical data sets, focussing on an evolutionary programming 
technique and an information-theory technique. We demonstrate their use with classi- 
cal astronomical relationships - the Hertzsprung- Russell diagram and the fundamental 
plane of elliptical galaxies. We also show how they work with the issue of binary classi- 
fication which is relevant to the next generation of large synoptic sky surveys, such as 
LSST. We find that comparable results to more familiar techniques, such as decision 
trees, are achievable. Finally, we consider the reality of the relationships discovered 
and how this can be used for feature selection and extraction. 
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1 INTRODUCTION 

The rate of scientific discovery in astronomy has tradition- 
ally been tied to the amount of data available. The advent 
of digital astronomy with modern detectors and computa- 
tional resources, e.g., databases, has changed this. Although 
more data in the past two decades has allowed us to dis- 
cover the cosmic web, dark energy, and exoplanets, the vast 
majority of such low-hanging fruit have now been found. 
The new challenge is growing data complexity. The era of 
data-intensive astronomy promises a vastly more thorough 
exploration of parameter space but the discovery of new 
scientifically significant relationships is equally more com- 
plicated and daunting when faced with overwhelming data 
dimensions and volumes. 

Given a highly complex data set, such as SDSS, with 
hundreds of parameters for each object, and suflicient num- 
bers of objects - hundreds of millions or more - to provide a 
fair and representative covering of parameter space, one will 
uncover many significant relationships - linear, nonlinear, 
functional, structural - between pairs, triplets and groups of 
parameters. However, only a small fraction of these will be 
truly causative, the result of some valid underlying astro- 
physical process or processes, and identifying these is non- 
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trivial. In fact. lCubitt. Eiser fc Wolj (|2012l) have shown that 
identifying the underlying dynamical equations from any 
amount of experimental data, however precise, is a provably 
computationally hard (NP-har43) problem. 

The framework of astroinformatics, combining astron- 
omy, applied computer science and information technology, 
has arisen to contend with this computational intractability. 
At its core are sophisticated data mining and multivariate 
statistical techniques which seek to extract and refine infor- 
matio n from highly complex data sets (see lBall fc Brunnej 
(|20ld ) fo r an overall review of di fferent techniques in as- 
tronomy, iBloom fc Richards! (|201ll ) for those specific to the 



time domain and the IVOA KDDIG web page^ for general 
material related to this). This includes identifying unique 
or unusual classes of objects, estimating correlations, and 
computing the statistical significance of a fit to a model in 



^ In computational complexity theory, a problem that is solvable 
in polynomial time by a nondeterministic Turing machine is an 
NP (nondeterministic polynomial time) problem. NP-hard prob- 
lems are those which are at least as hard as any NP- problem, e.g., 
given a set of integers, does any non-empty subset of them add 
up to zero? 

^ International Virtual Observatory Alliance Knowl- 
edge Discovery in Databases Interest Group: 
[iittp: //www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/ivoaKDD| 
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the presence of missing or bounded data, i.e., with lower 
or upper limits, as well as visualizing this information in a 
useful and meaningful manner. However, the nature of these 
methodologies is, at best, semi-automated with focused ap- 
plication in particular regions of discovery space rather than 
allowing an unbounded exploration of what might be there. 

In recent years a number of approaches have been pre- 
sented in t he general sci e ntific lit erature that seek to redres s 



this, e.g.. [Oliver et al.l (|2004). ISchmidt fc LipsonI |200i), 



ISparkes et al.l (|2010t ). iReshef et al.l (120111 ). Discoverv-based 



science employs cutting edge data mining techniques for au- 
tomated hypothesis forming and automated theorem prov- 
ing. Many of these tend to have originated within the con- 
text of systems biology (out of association analysis) where 
researchers are attempting to identify and derive universal 
relationships in biological systems akin to those which seem 
to exist in physical ones, although there is prior art in com- 
puter scien ce, particula rly within the area of genetic pro- 
gramming (|Kozal (|l992l )'). 

These methods are also similar in scope to various 
feature selection and extraction and dimensional reduc- 
tion techniqu e s, such as principal component anal ysis (e.g., 
iFrancis et ahl (|l992l )') and self-organizing maps jKohonenl 
(198^)), which attempt to counter the "curse of dimension- 
ality" by reducing high dimensional data to lower more man- 
ageable dimensions whilst preserving meaningful structures 
within them. Nonetheless these so-called automated discov- 
ery methods are applicable to any general data set and es- 
pecially to those with many variables, such as arise in eco- 
nomics, climate science, sensor networks or any field advo- 
cating an informatics-based approach. 

In this work, we describe the application of automated 
discovery systems of relationships to astronomical data. We 
have focused in particular on two types of approach - those 
that seek to identify general connections (correlations) be- 
tween particular parameters in a data set and those that 
try to formulate a specific functional relationship between 
parameters. These may be considered representative of the 
type of mapping of discovery space that has so far been at- 
tempted. A common complaint of data mining techniques is 
that they usually follow a "black box" approach - the data 
goes in and the answer comes out but there is no real un- 
derstanding of how one led to the other. We hope to show 
that automated discovery systems are also more translucent 
if not actually transparent and allow some deconstruction 
of the methodology to understand what is going on inside. 
This is particularly important if their discoveries are to be 
scientifically validated, i.e., a particular relationship is not 
only statistically significant but also stems from a (new) 
non-trivial underlying cause. 

It should be noted that although these discovery tools 
are labelled as automated, they are actually employed as 
part of a collaborative human-machine discovery process. 
In data-intensive problems, not only is data processing and 
analysis automated but also necessarily, given the data vol- 
umes and dimensions, the first levels of data interpreta- 
tion. The human expert now validates machine-generated 
hypotheses rather than attempting to formulate them them- 
selves. We still make discoveries, but as the complexity of 
data increases, we need machine intelligence to help us guide 
towards an insight. 

This paper is structured as follows: in section 2, we will 



describe the two specific techniques we are applying whilst 
in section 3, we will present a number of different astronom- 
ical contexts in which these have been applied - these both 
attempt to mimic or recreate past discoveries as well as find 
new ones. We will analyze and discuss our results in sections 
4 and 5 and present our conclusions in section 6. 



2 AUTOMATED DISCOVERY SYSTEMS 

The methods we are applying in this paper will probably 
be unfamiliar to many astronomers and so, in this section, 
we will introduce some of the pertinent terminology and 
formalism related to them. 



2.1 Maximal information coefficient 



The maximal information coefficient (MIC; iReshef et al.l 
(j201lh l aims to be the 21 st-century eq uivalent of the Pear- 
son correlation coefficient (|Speedl(|201lh l but it goes beyond 
just expressing linear associations and can quantify general 
associations between variables. It is based on the mutual 
information between two random variables, A and B: 



MI(AB) = ^^p(a,6)log 



p{a,b) \ 
p{a)p{b) ) 



(1) 



where p(a) and p{b) are the marginal probability mass func- 
tions of A and B and p(a, b) is the joint probability mass 
function respectively. 

Now consider a partitioning of a data set, D, of or- 
dered pairs, {{ai,bi),i — 1, . . . ,n}, into an x-hy-y grid, G, 
such that there are x bins (of variable size) covering a and 
y bins (also of variable size) spanning b respectively. The 
probability mass function of a particular grid cell is clearly 
proportional to the number of data points falling inside that 
cell and so, for a given (x, y), there will be a maximal mutual 
information. We can then construct a characteristic matrix 
M{D) whose elements: 



<MI) 



log rmsi{x,y} 



are the highest normalized mutual informations achieved by 
any of the corresponding x-hj-y grids. The maximal infor- 
mation coefficient is then defined to be the maximum value 
in M, such that xy < C where C is a function of the sample 
size and represents the maximal grid size considered. Too 
high a value for C can lead to non-zero scores even for ran- 
dom data because each data point gets its own cell, while 
setting it too low means th at only simple patterns are con- 
sidered. iResheFeT^al] (Hoill) found empirically that C = n"'® 
provides a satisfactory limit: 

MIC(Z)) = max {M{D)^,y} 

xy<C{n) 

The behaviour of MIC is that it tends to 1 for all never- 
constant noiseless functional relationships and to for sta- 



tistically independent variables. Moreover, MIC 



where 



r is the Pearson correlation coefficient, is an indicator of a 
nonlinear relationship between two variables: as r is a mea- 
sure of linear dependence, the statistic MIC - r^, is near to 
for linear relationships and large for nonlinear relation- 
ships with high values of MIC. Other measures involving 
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2.2 Symbolic regression 
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Figure 1. This shows three separate 3x2 grid partitions of a data 
set of 100 points randomly selected from a cubic relationship. The 
mutual information for each grid is; solid line: 0.059, dashed lino; 
0.044 and dotted line; 0.023 respectively. 



MIC and M (the characteristic matrix) can also indicate de- 
viations from monotonicity, the degree to which the data set 
appears to be sampled from a continuous function and the 
complexity of the association, as different relationship types 
give rise to characteristic matrices with different properties. 

The statistical significance of a MIC value can be deter- 
mined from comparison of a real value against a set of values 
from 1/a — 1 surrogate data sets where a is the probability 
of false rejection. Because MIC is a rank-order statistic, the 
uncorrected p-value (essentially the one-tailed p-value for 
this statistic; when multiple hypotheses (many parameters) 
are being tested, a corrected value must be used to mitigate 
false positives) of a given MIC score under a null hypothesis 
of statistical independence depends only on the score and on 
the sample size of the relationship in question and not on the 
specific relationship being tested. Precomputed uncorrected 
p-values are available for different sample size^. 

To illustrate this statistic, consider a data set of 100 
points randomly selected from a cubic relationship {y = 
2x^ — — 3x + 2, X £ [—1.5, 2.5]) plus a unitary Gaussian 
noise term, i.e., a Gaussian about the y-value with a = 1. 
We can partition this data set into a set of 3 x 2 grids (the 
maximum grid size is set to xy < 15.8 for this data) of vari- 
able size (see Fig. [T]). Each grid has a mutual information 
associated with it and for a given partition configuration, 
e.g., 3x2, there will be a maximal mutual information. The 
maximal (normalized) mutual information across all config- 
urations (44 in this case) is the maximal information co- 
efficient. This data set has a statistically significant MIC 
of 0.836 compared to a linear regression coefficient of just 
0.068. It also has high values for the measures indicating 
nonlinearity (0.831) and functionality (0.836) and moderate 
non- monotonicity (0.427). 



Symbolic regression is the task of finding a function, in sym- 
bolic form, that fits a finite sample of data. The most ef- 
ficient appro ach employs a genetic algorithm-based search 
(Koz^ (| 19921 )) of the space of mathematical expressions to 
determine the best-fitting functional form. Successive gener- 
ations of formulae are specified in terms of a (user-defined) 
mathematical alphabet of atomic building blocks, such as 
algebraic and boolean operators, analytical function types 
- trigonometric, exponential/logarithmic, power laws, etc., 
and state variables, which keeps the search tenable. Its ad- 
vantage over more standard regression methods is that the 
search process works simultaneously on both the model spec- 
ification problem (the form of the fitting equation) and the 

problem of fitting coefficients. 

EureqcQ (now also called Formulize) jSchmidt fc LipsonI 

(|2009l )) is a software tool which employs symbolic regression 
to describe a data set by identifying the simplest mathemat- 
ical formulae which could describe the underlying mecha- 
nisms that produced the data. The tool works from the nu- 
merical partial derivatives of each pair of variables in the 
input data set and uses an evolutionary algorithm to explore 
this partial differential metric space for non-trivial invariant 
quantities, looking for predicted partial derivatives that best 
match the numerical ones: 
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where f{xi, yi) is one of the candidate functions. The search 
continues until some stopping criterion - time elapsed, 
goodness-of-fit, confidence of fit (maturity and stability), 
etc. - is met. The output is then an ordered list of final 
candidate analytical expressions on the accuracy-parsimony 
Pareto front, i.e., the tradeoff between the most optimal 
(best-fitting according to some criteria) and complexity. 
Each mathematical operation in an expression has a numer- 
ical value (cost) associated with it, e.g., addition = 1, expo- 
nentiation = 4, and the complexity of a formula is defined 
here as the sum of these values. A high-order polynomial 
could therefore be more complex than a straightforward ex- 
ponential or trigonometric function. 

When comparing and optimizing solutions, Eureqa em- 
ploys a user-defined error metric. A number of different mea- 
sures are available and the nature of the data can help deter- 
mine which is the most appropriate, for example, minimizing 
the mean of the squared residual errors is suitable for nor- 
mally distributed noise whereas the logarithmic error is bet- 
ter for many outliers. Data can also be weighted according 
to some prescription, although the importance of particular 
variables can always be explicitly stressed in the definition 
of the equation form being searched for. There are, too, vari- 
ous types of data preprocessing operations available, familiar 
to data mining, such as normalization, outlier rejection and 
missing value handling. 

The results of symbolic regression, i.e., the expressions 
identified by Eureqa, are the best (non-trivial) mathematical 
descriptions of the data. Their interpretation and physical 
validity, however, remain an exercise for the human expert, 



^ http:/ /www. exploredata.net/Downloads/P- Value- Tables 



* http;/ /creativemachines. cornell.edu/eureqa 
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who may take them at face value or decide to cross check 
them ("prove them") using other techniques. 



3 EXPERIMENTS 

In tliis section, we report on a number of automated discov- 
ery experiments we have carried out with different represen- 
tative astronomical data sets. A number of different options 
are available, depending exactly on how you want to mea- 
sure the whole process. It should be noted that in applying 
our techniques, we are not simply fitting a set of formulae 
to data but that the respective discovery methods decide 
which variables to use and in what functional relationship 
and then find the optimal coefficients and measures of fit. 
The two methods are also sufficiently different that it is in- 
teresting to compare their findings relative to each other. 



3.1 The Hertzsprung- Russell diagram 

The Hertzsprung-Russell (HR) diagram is the quintessen- 
tial representation of physical relationships associated with 
different stages of stellar evolution. The original plot of 
magnitude vs. temperature can be considered as the two- 
dimensional PDF, P(M, Te//); more modern versions also 
incorporate metallicty and surface gravity giving a four- 
dimensional PDF, P{M,Teff,[M/H],\ogg) - which con- 
strains all of its arguments. The parameterization of these 
relationships in terms of observable and non-observable stel- 
lar quantities expressed as a function of th e observable color 
is an o pen probl e m in astronomy, e.g., IWilson fc HurlevI 
l|2003h , [Zaiiinettil (|2008l '). This is particularly relevant for 
the next generation of large photometric surveys, e.g., LSST, 
where spectrosc opy of every source is not feasible. Note that 
iLiu et all (|2012h describe a related problem of inferring the 
astrophysical parameters of stars from Gaia spectrophotom- 
etry. 

Unfortunately, prior to the availability of the Gaia 
data, there is no single large stellar data set which of- 
fers both accurate distances and physical parameters for 
a representative sampling of the HR space. Hipparcos 
has reliable distances but no i ntrinsic parameters , such 
as Te// or \M/H]. RAVE D R3 (|Siebert et all (|201ll )) and 
SEGUE (|Yannv et al.l (|2009l ')') both offer spectroscopically- 
determined parameters {T^f j , g,[M / H\) but lack distance 
information - RAVE DR3 shares only 685 objects with Hip- 
parcos and SEGUE none. A photometric parallax relation- 
ship has been d efined for SEGUE based on stellar metal- 
licity and color (|lvezic et al.l (|2008l )') but the corresponding 
HR diagram shows only a main sequence (see Fig. [Sjl . For 
a relatively complete coverage of the parameter space, we 
have therefore constructed a data set consisting of all stars 
in SIMBAD with a quoted parallax, effective temperature 
(Te//), surface gravity (g), and metallicity [M/H]). Fig. [3] 
shows the HR diagram for this data set of 3865 stars. 

As a comparison, we have also considered the distribu- 
tion of parameters for stars from a single globular cluster. 47 
Tuc is one of the most studied globular clusters: it is compar- 
atively near, one of the more massive (a nd hence populous ) 
clusters, and it is relatively metal rich. iLane et ai.l mm 
give stellar parameters for 19 92 stars in 47 Tuc but only 




Figure 2. This shows the Hertzsprung- Russell diagram for the 
SEGUE data using photometric parallax to determine absolute 
magnitude. 
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Figure 3. This shows the Hcrtzsprung-RusscU diagram for the 
3865 stars in SIMBAD with parallax, effective temperature, sur- 
face gravity and metallicity values. The color coding is according 
to spectral type , broadly defined as: blue - O, B, A; green - F, 
G; yellow - K; red - M. 



have measured B, V, and I for ~ 200000 stars in 47 Tuc, giv- 
ing us a final data set of 1739 stars with stellar parameters: 
Tsff,g,[m/H] (uncalibrated metallicity) , [a/Fe],^ (micro- 
turbulence) and Vrot (rotational velocity) and B and V mag- 
nitudes (see Fig. |4] for its HR diagram). 

We ran Eureqa on both data sets to see if we could re- 
cover a suitable formulation of the HR relationships, specif- 
ically instructing the code to look for formulae of the form: 

Mv=f{B-V,g,nff,[M/H]) 

in the first case (general HR) and: 

Mv = f{B - V,g,nff, [m/H], [a/Fe],^,Vrot) 

in the second (47 Tuc). Although these formulations are 
based on prior knowledge of what the dependent variables 
are and also what data is available, symbolic regression in- 
corporates feature selection and so will only use a subset of 
the most relevant variables, in this case those which persist 
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Figure 4. This shows the Hortzsprung-RussoU diagram for the 
stars in 47 Tuc taken from Bergbusch & Stetson (2009). The 
red points indicate those objects for which spectroscopically- 
determined parameters are available (Lane el al. (2011)), which 
are limited to those which have turned off the main sequence. 



in successive generations of calculations in the evolutionary 
algorithm, rather than all available variables (see section 3.3 
for an explicit demonstration of this). We also consider the 
choice of variables more in section 4. 

We use a set of mathematical building blocks restricted 
to: constants, basic operators (+, -, *, /), exp(), log(), x". 
We employed an R'^ goodness-of-fit error metric - Eureqa 
attempts to maximize this quantity in its fits - and selected 
an 80:20 split of the data in terms of test set and valida- 
tion set. Data with any missing values were ignored (other 
options are available) and no weighting was used for any pa- 
rameter in terms of its error as the heterogeneity of the data 
means that not every value has an error associated with it. 
27 CPU-hour runs (taking 1.5 hours on 18 cores) produced 
a number of formulae of varying complexity and correlation 
coefficients of around 0.85 red for both data sets (see Ta- 
ble [l|. We also ran it for the SIMBAD data set restricting 
the formulae to just power law expressions (no exp() or log() 
operators). 

We shall consider the results obtained for the SIMBAD 
data set first. To validate the results and test against overfit- 
ting, i.e., the formulae are actually describing random errors 
or noise in the data instead of any underlying relationship, 
we determined the median absolute error for each formula 
when applied to the RAVE DR3 and SEGUE data sets men- 
tioned above. Johnson B and V magnitudes were derived 
fr om the SEGUE data using the transformation equations 
of iLuptonI (|2005l ) and an absolute V magnitude determined 
using the inferred parallax relating apparent r magnitude 
and absolute r magnit ude calculat e d usin g the photomet- 
ric parallax method of llvezic et al.l ()2008h . Any systematic 
errors that these transformations may introduce can be es- 
timated from a plot of My vs. Teff for the SEGUE data 
compared to the SIMBAD data. Since Te// is calculated 
spectroscopically, any photometric offset will show in the 
relative positions of the main sequences of the two data sets. 



The left plot in Fig. [5] shows good agreement between 
the SIMBAD and RAVE DR3 data sets with an offset of the 
SEGUE data relative to the other two. This offset can be 
estimated from the difference between linear fits to the main 
sequences of both SIMBAD and SEGUE data sets defined 
between the regions of Te// — 5000 and Te// ~ 6500 and 
we find a value of AMy = -1.112 for SEGUE. A similar 
procedure can be performed with plots of {B — V) vs. T^ff 
to estimate any systematic errors in the color and we find 
a value of A{B - V) = -0.041 for SEGUE. The right plot 
in Fig. [5] shows the agreement between the three data sets 
when the offsets have been applied. 

Table [1] gives the median absolute difference (MAD) 
between the "measured" absolute magnitude and the esti- 
mated value for each formula when applied to the SIMBAD, 
SEGUE and RAVE data sets. For comparison, we also com- 
puted the MAD between the observed data an d the val- 
ues de rived from the semi-analytical formulae of IZaninettil 
(|2008h relating My and {B — V) for each data set (note that 
Zaninetti's other formulae relating mass, radius and lumi- 
nosity to {B — V) all derive from these), although those are 
only defined over the range —0.33 < {B — V) < 1.80 and are 
stellar luminosity class dependent, with separate relation- 
ships for main sequence, giant, supergiant and white dwarf 
stars. 

It is worth bearing in mind when looking at these results 
that the various functional relationships that this approach 
finds are, in some statistical sense, the optimal descriptors 
of the data - they are phenomenological. Their physical in- 
terpretation, however, remains the purview of the human 
scientist. This method aims to identify all potentially in- 
teresting, significant relationships without any preconceived 
bias, e.g., due to some established notion of what should 
actually be there. 

The results for the SIMBAD and SEGUE data sets are 
broadly consistent suggesting that the found formulae pro- 
vide a good description of the variable relationships in the 
data but do not overfit it. It should be not surprising that 
the Zaninetti formulae give better results for the SEGUE 
data set since it essentially just consists of a main sequence 
and uses that class specific result. The Eureqa results are 
derived for a range of luminosity types and so give a bet- 
ter broader fit but not necessarily for specific luminosity 
classes. The poor performance on the RAVE DR3 data set 
can be largely attributed to the errors on the parallax value 
(the mean value is 7.63 mas with a mean error of 1.91 mas) 
and thus the absolute magnitude (54% of the objects have 
o'Mir > 1). If we restrict the analysis to those stars with 
fA/v- < 1, we find that the MAD values drop to ~ 1 for the 
Eureqa formulae and 0.6 for the Zaninetti formula. 

We can also constrain the Eureqa algorithm to use those 
formulae which contain particular variables or terms: for ex- 
ample, a number of the solutions in both sets of formulae 
contain a term. A set of formulae derived with these lim- 
itations have similar MAD values as for the more generic 
power law. 

The relationships found for the 47 Tuc data set are more 
specific since they only cover post-main sequence (PMS) 
stars. Table[2]shows that they fare much better than the Za- 
ninetti formula on this data. We also note that the formula 
include dependencies on parameters related to convection 
phenomena in stellar atmospheres as would be expected for 
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Figure 5. This shows the relative positions of the main sequences of SEGUE (red), SIMBAD (blue) and RAVE DR3 (black) data sets. 
In the left plot, the SIMBAD and RAVE data sets agree well but there is an offset of the SEGUE main sequence relative to the other two 
introduced by errors in the photometric transformation and parallax methods applied to it. Linear fits to the main sequences estimates 
the offset as AMy = —1.112. The right plot shows the agreement between the data sets when this offset and a similarly estimated color 
offset of A(B -V) = -0.041 is appUed to the SEGUE data. 



Table 1. The best- fitting formulae found by Eureqa to describe the SIMBAD HR diagram. The goodness-of-fit value (R^) is determined 
from the 20% of the data set defined to be the validation set. The error value for each data is the median absolute error when applying 
the fit. The first three formulae are free to include exponential, logarithmic and power law components whilst the second are restricted 
to just power law expressions. 



Function {My = /(...)) 




Complexity 


/?2 


SIMBAD 


RAVE DR3 


SEGUE 


0.374C/2 _ 4.34 x 10-'*Te// 
{B-V){l-[M/H]) + I^^- 
gi(B-V) + 2.33)+ 


7.62 

log(Tejj)-0.645[M///] 


9 
16 
30 


0.823 
0.867 
0.889 


0.569 
0.432 
0.409 


2.104 
1.584 
1.765 


0.932 
0.669 
0.644 


^"^s' 2.48 




10 


0.822 


0.435 


1.567 


0.891 


366+t!„ - 1-94 - 9-56 X lO-^T,^^ 

0. 641+0. 791[J\//fl"l 1 14000+147209 


- 0.348(1////] 

- 5.17 


20 
29 


0.842 
0.868 


0.408 
0.401 


1.488 
1.528 


0.799 
0.776 


1.18-(B-V)-g ' f -917g 


Zaninetti 








0.597 


0.822 


0.504 



PMS stars. The metallicities used in the fitting formulae, 
[m/H], ar e the uncalibrated ones determined by the RAVE 
pipeline in lLane et al.l (|201ll ) - the uncertainties are 0.1 dex. 
To compare the fits on PMS stars from the SIMBAD data 
set, we need to repl ace \m/H] with an eq uivalent expression 
in terms of [M/H]. IZwitter et al.l (|2008l ) give a calibration 
equation for RAVE-derived metallicities: 

[M/H] = 0.93S[m/H] + 0.767[a/Fe] - 0.064 log 5 + 0.404 

but note that [a/Fe] has a typical recovery error of up to 
0.15 dex and only spans 0.4 dex in range. Thus, although the 
SIMBAD data have no measured [a/Fe], we can assume a 
mean value of 0.2 for use in determining uncalibrated metal- 
licities with reasonable accuracy. We note that the SIMBAD 
PMS data also shows greater intrinsic scatter than the 47 
Tuc data. 

As Table[3]shows, for the SIMBAD data, MIC identifies 
statistically significant relationships between My and {B — 
V) , Tef f and g respectively but not [M / H] . Those involving 



Teff and {B — V) are also more likely to be nonlinear in 
nature than that with g. In fact, there seems to be a general 
set of relationships between My, {B — V), T^ff and g but 
not with [M/H]. Certainly, this is in line with the Eureqa 
formulae where the [M/H] dependence is not complex but 
strictly linear. The MIC results for the 47 Tuc data show 
the significant relationships found in the SIMBAD data set 
but also ones involving microturbulence and metallicities as 
we would expect for PMS stars. Note that there is a weak 
dependence between Vrot and My but not between it and 
any other parameter. The relationships between My and 
{By),g, Tef f and ^ are also again more likely to be nonlinear 
in nature. 

3.2 The fundamental plane of elliptical galaxies 

The global properties of elliptical galaxies, such as lu- 
minosity, projected velocity dispersion, e tc., fo rm a two- 
dimensional family (|Diorgovski fc David |l983), hereafter 
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Table 2. The best-fitting formulae found by Eureqa to describe the 47 Tuc HR diagram and the median absolute error obtained when 
applying the fit to the data. 



Function {Mv = /(■■■)) 


Complexity i?^ 


47 Tuc 


SIMBAD PMS 


0.685+ 


8 


0.283 




g-lm./H]-2.39[a/Fe] ^ 
B-V 5 


12 


0.260 




13.2 + (0.002Te// - 1.62g - 8.42(B - V))[m/H] - 


- 12.8(B - V) 23 


0.239 


0.990 


Zaninetti 




0.773 


0.745 



Table 3. The MIC measures between the variables used to define the Hertzspung- Russell diagram for the SIMBAD and 47 Tuc data 
sets. For these data sets, a value of MIC > 0.41 for SIMBAD and MIC > 0.17 for 47 Tuc is significant at the 10"** level respectively. 
This also illustrates the nx(n — l)/2 nature of the output for a data set of n variables. 



Variable pair 


MIC 


Nonlinearity 


Non-monotonicity 


Functionality 


Complexity 


Linear regression 


SIMBAD 














(B-V)vs. T,^^ 


0.82 


0.40 


0.02 


0.82 


7.11 


-0.65 


My vs. g 


0.63 


0.04 


0.05 


0.63 


7.11 


0.77 


Mv vs. T^ff 


0.54 


0.48 


0.08 


0.54 


7.11 


-0.24 


g vs. T^ff 


0.52 


0.46 


0.03 


0.52 


7.11 


0.24 


Mv vs. {B - V) 


0.49 


0.47 


0.09 


0.48 


7.11 


-0.12 


[B - V) vs. g 


0.46 


0.14 


0.0 


0.46 


7.11 


-0.57 


T,ff vs. [M/H] 


0.11 


0.09 


0.02 


0.11 


7.11 


0.14 


Mv vs. [M/H] 


0.09 


0.08 


0.03 


0.09 


7.11 


-0.09 


(B - V) vs. [M/H] 


0.08 


0.08 


0.01 


0.08 


7.11 


0.01 


g vs. [M/H] 


0.07 


0.06 


0.02 


0.07 


7.11 


0.10 


47 Tuc 














Mv vs. {B - V) 


0.75 


0.54 


0.22 


0.75 


6.43 


-0.45 


Mv vs. g 


0.62 


0.56 


0.03 


0.62 


6.43 


0.23 


g vs. Te// 


0.56 


-0.01 


0.03 


0.56 


5.75 


0.76 


{B-V) vs. T,jj 


0.56 


0.08 


0.04 


0.56 


6.29 


-0.69 


{B - V) vs. g 


0.54 


-0.03 


0.06 


0.54 


6.13 


-0.75 


Mv vs. T^if 


0.50 


0.49 


0.09 


0.50 


6.43 


0.14 


Mv vs. 5 


0.38 


0.36 


0.08 


0.38 


6.25 


-0.12 


[a/Fe] vs. ^ 


0.34 


0.32 


0.11 


0.34 


6.43 


-0.13 


TefS vs. C 


0.29 


0.28 


0.13 


0.28 


6.43 


0.07 


[m/H] vs. [a/Fe] 


0.23 


-0.09 


0.01 


0.23 


6.36 


-0.57 


g vs. i 


0.23 


0.12 


0.03 


0.23 


6.43 


-0.33 


{B - V) vs. e, 


0.22 


0.14 


0.06 


0.22 


6.43 


0.28 


Mv vs. [m/H] 


0.21 


0.20 


0.09 


0.21 


6.43 


0.11 


Mv vs. Vrot 


0.19 


0.19 


0.14 


0.19 


6.43 


-0.02 


g vs. [m/H] 


0.17 


0.03 


0.02 


0.17 


6.43 


0.38 



DD87). In particular, an empirical relationship was found by 
multibivariate statistics between the mean surface bright- 
ness, central velocity dispersion, and effective radius of 
an elliptical galaxy - the so-called fundamental plane 
- which can b e em ployed as a distance indicator, e.g., 
iDressler et all (|l987l ). This has its physical basis in the 
virial theorem, although there are further structural de- 
pendencies exhibited between d warf and giant ellipticals 
l|Guzman. Lucev fc Boweil (|l993t )). 

Using Eureqa, we searched the original data set (161 
objects) used by DD87 for relationships of the form: 

log(re) = /(logo-, </.t>,A/g) 

where re is the semimajor axis, a is the velocity dispersion, 
< /I > is the mean surface brightness and Mg is the absolute 
magnitude in the ra band. We used a slightly modified set 
of building blocks from that which we used in the previous 
section, in that we also allowed for periodic behaviour which 



could be described in terms of a sine function. This increases 
the size of search space available, allowing for a wider set of 
possible relationships, but also, potentially, the computation 
time. We note, however, that have no expectation of periodic 
behaviour; in fact, we know that it would make no sense in 
this particular context. Rather part of the experiment is just 
to see what effect allowing for it in the building blocks might 
have. We also employed a fitness metric based on the mean 
absolute error. 

The best fit (lowest complexity, highest accuracy) for- 
mula was: 

log(r-e) = log cr-f 0.271 <fi> -4.09 

which should be compared with the original relationship re- 
ported by DD87 (see also Fig. O: 

log(re) = 1.39(log o- + 0.26 < ^ >) - 6.71 

The two fits have equivalent accuracies - both give rms er- 
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Figure 6. This shows the relative distributions of the original 
fundamental plane relationship discovered by Djorgovski & Davis 
(1987) (diamonds) and that found by Eureqa (triangles). 



rors of 0.157 and correlation coefficients of 0.91. Higher or- 
der formulae give even slightly better fits, e.g., log(re) = 
(0.14 logo- - 1) <^i> ~Mg with 0.142 and 0.92 respectively, 
but there is a danger that this is overfitting the data, par- 
ticularly given the small size of the data set. We note also 
that the sine function was not used. 

Given that both relationships are sufficiently similar in 
form (complexity) and accuracy, the question arises as to 
which one is correct? This judgement call is beyond the 
scope of current discovery systems and is where the (human) 
expert must step in and provide the necessary interpretative 
knowledge. In this case, the relationships are encoding physi- 
cal correlations between the size of a galaxy and its effective 
surface brightness and the luminosity and central velocity 
dispersion. Even though the form of the expressions is sim- 
ilar, they actually translate into quite different predictions. 
The DD87 formula gives: 



</i> ' 



J 7-1 1-4 ^ ^ -0.07 

and On oc (T() <fi> 



where D„ is the diam eter within wh i ch th e mean surface 
brightness is 20.75/is (jPressler et all (|l987l D. whereas the 
Eureqa result gives: 

< /i > ~ L^'^ and D„ oc (tq < /i >'''^^ 

implying that more luminous galaxies have much lower sur- 
face brightnesses and that the distances to galaxies is less 
than that actually seen. 

The Eureqa fit makes no use of Mg and the value of MIC 
for this variable relative to log(re) is the lowest, consistent 
with a lack of dependency. There is also no indication of any 
type of bivariate relationship beyond a linear one, although 
Eureqa finds non-linear multivariate relationships to which 
the MIC is most likely not too sensitive. 



3.3 Binary classification of light curves 

Determining whether an object belongs to a specified class 
or not, e.g., a transient detection is a supernova or not, or, 



alternatively, whether it falls into one of two different (mu- 
tually exclusive) classes, such as star or galaxy, is an increas- 
ingly common activity in astronomy (note that multi-class 
classification problems can always be recast as a series of 
such binary decisions) . This is particularly true of survey as- 
tronomy where large data volumes and, most recently, real- 
time data streams require fast, accurate, and reliable classifi- 
cation systems. A variety of techniques h ave been employed 
in response. e.g.lDiorgovski et al.l (l2012bl '). including decision 
trees (e,g., Vasconcellos et al.l (|2011 )'). Bayesian networks 
(e.g., iDubath et all (l201ll)'). and support vector ma chines 
(SVM. e.g.. iBeaumont. WiUiams fc (Gloodmanl (120111 )). the 
latter representing the state of the art. 

Although it seems somewhat counterintuitive, auto- 
mated discovery systems can also be used as binary clas- 
sifiers. With Eureqa, the "trick" is formulate the search re- 
lationship as: 

class = g{f(xi,X2,X3, Xn)) 

where g is either the Heaviside step function or the logistic 
function, which gives a better search gradient and can be 
used to produce ROC0 curves for the resulting classification. 
Eureqa finds a best-fit function, /, to the data that will get 
mapped to a or a 1, depending on whether it is positively 
or negatively valued (or lies on either side of a specified 
threshold value, say 0.5, in the case of the logistic function). 
In other words, it finds an equation for the discriminating 
hyperplane which separates the two classes in some high- 
dimensional feature space. This is comparable to what a 
SV1V0 does but with an explicit computation of the mapping 
into feature space rather than just relying on inner products 
within it. An advantage of this approach is that the structure 
of the analytical fit function can also give insight into how 
the classification works, which is not normally true of other 
"black box" classifiers, such as neural networks. 

The Catalina Real -Ti me Transient Surve y 
(CRTS: brake et all (|2009l ): IPiorgovski et all (|2012al ): 
iMahabal et al.l '( 2011 )) is the largest open time domain 
survey currently operating, covering ~ 33000 deg^ between 
-75° < Dec < 75° (except for within ~ 10 - 15° of 
the Galactic plane) to ~ 20 mag. Light curves of several 
hundred million objects are availabl^H with an average 
of ~ 250 observations over a 7-year baseline. A common 
approach to light curve classification is to characterize the 
light curves through extracted features, such as moments, 
flux and shape ratios, variability indices, and periodicity 
measures. Vectors of such features derived from the light 
curves of known classes of objects are then used as the 
training sets for particular classifiers. 

We have considered three specific binary light curve 
classification problems using Eureqa: RR Lyrae vs. W UMa, 
CV vs. blazar, and Type la vs. core-collapse supernova. For 
each case, we compiled data sets of light curves of the ap- 
propriate classes of object and derived ~ 30-60-dimensional 
feature vectors for each object (see Appendix 1 for the full 



A ROC curve is a graphical plot which summarizes the per- 
formance of a classifier over a range of tradeoffs between true 
positive and false positive errors rates (see Fig. |9]) 
^ A Support Vector Machine (SVM) is the state-of-the-art binary 
classification algorithm. 
^ http://crts.caltech.edu 
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list of fe atures used) . These ar e a co mbination of the features 
used by [Richards et all (|20lj ) and iDebosscher et al] (|2007l ') 
and include statistical moments, flux ratios, Stetson J and 
K variability indices, a quasar-fitting measure and frequency 
analysis statistics. 

We ran a set of 10 4 CPU-hour Eureqa runs (1 hr on 
a quad-core machine) for each of three cases with each run 
omitting 10% of the data (giving training sets that are 90% 
of the data set) and the best-fit solution for that run (defined 
as the least complex which produces the largest number of 
true positive and negative class attributions) then applied 
with the omitted data as the validation set so giving us lOx- 
cross-validation on the resulting solutions. We report our 
results (see Table [Ijl in terms of the sum of all the results 
from the cross-validation runs. A logistic function was used 
in all cases to map the fitted function to the class variable. 
The physical interpretation of any relationships identified in 
these problems is beyond the scope of the present paper and 
will be addressed elsewhere. 



3.3.1 RR Lyme vs. W UMa 

Eclipsing binaries (W UMa) are the predominant contam- 
inant in studie s usin g RR Lyrae as tracers of Galactic 
structures, e.g., ISesaJ (|201ll ). and therefore being able to 
distinguish between them would be useful. We extracted 
CRTS hght curves for 482 RR Lyrae and 463 W UMa 
from SIM BAD and the AAVSO I i iterna tional Variable 
Star Ind ex (I Watson. Henden fc Pried (l2006l)) obtained from 
VizieR jOchsenbein. Bauer fc MarcoutI (|2000h ). The mag- 



nitude distribution for both classes of objects are shown 
in Fig. [T] Since both classes of object are periodic, we in- 
cluded periodic features in our characterization and used 
60-dimensional feature vectors. 

The overall best-fit formula was: 

/■ n^r^ 6.63 „ . 

/ = 278 X24. 24 

xio 

where 2:24 is the pr i ncipal period f r om th e Lomb-Scargle 
periodogram (|Lombl (| 19761 ) : IScargi3 (|l982t )) and xio is the 
median absolute deviation. The resulting values of class 
are 1 for RR Lyrae and for W UMa objects. The com- 
bined confusion matrix for the best-fit classifying formu- 
lae, i.e., summing the individual cross-validation results, is 



Figure 8. This shows the distribution of RR Lyrae (blue) and W 
UMa (red) stars in the period - median absolute difference plane 
identified by Eureqa. 



shown in Table |4^ and the ROC curve showing the de- 
pendencies between the true and false positive classifica- 
tion rates respectively as the logistic function threshold 
value is varied in Fig. [9^. It is interesting to note that 
this is essentially the period-amplitude relation which is 
used to differentiate between su bclasses of RR Lyrae (e.g., 
ISmith. Catelan fc Kuehnl (|201ll )). Fig. [8] shows how the two 
populations are clearly separated in this parameter plane. 

MIC measures were calculated for all pairs of features in 
this feature set. We would expect that significant relation- 
ships would be found for pairs of variables having a com- 
mon basis, e.g., those derived from the Lomb-Scargle pe- 
riodograms of the light curves or those which measure the 
fraction of outliers or degree of spread in the light curve, 
and this was confirmed. The MIC measure also largely cor- 
related with the regression coefficient for these pairs, i.e., 
those with a high MIC value had a high value as well 
and vice versa, but one strongly related pair (MIC close to 
1) had a very low linear regression (~ 0.12). The nonlinear- 
ity MIC statistic indicated was also large for this pair and 
the two features were found to be inversely proportional to 
each other. This clearly illustrates the power of MIC over 
traditional bivariate relationship analysis algorithms. 

MIC analysis of this feature set - calculating the MIC 
measures for all pairs of features - showed significant re- 
lationships between expected pairs of variables, e.g., those 
derived from the Lomb-Scargle periodograms of the light 
curves or those which measure the fraction of outliers or 
degree of spread in the light curve. These largely corre- 
lated with the regression coefficient for these pairs but 
one strongly related pair has a very low linear regression 
(~ 0.12). The nonlinearity MIC statistic indicated such a 
relationship and the two features were found to be inversely 
proportional to each other. This clearly illustrates the power 
of MIC over traditional bivariate relationship analysis algo- 
rithms. 

Looking for relationships between the class variable for 
the data set and the features showed a number of significant 
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[p < 10^^) pairings. All associations were also deemed to be 
complex and, with the exception of the median absolute de- 
viation, nonlinear. We'll discuss these in further detail later 
in the paper. 

3.3.2 CV vs. blazar 

The light curves of cataclysmic variables (CVs) and blazars 
can be difficult to differentiate as both exhibit aperi- 
odic/quasiperiodic variability with significant (several mag- 
nitudes) sudden outbursts. We extracted CRTS light curves 
for 404 known CVfjfl and 120 Fermi and MOJAVE blazarfl 
Periodic features were omitted in the characterization, giv- 
ing 25-dimensional feature vectors. The overall best-fit for- 
mula was: 

/ _o qyq \ 

/ = 140067x17 sin — — - 264152 

\xi — 1.481 / 

where xi is the amplitude an d xit is the signi f icance of the 
chi-squared quasar statistic ([Butler fc Bloonil ()201ll 'l'l. The 
combined confusion matrix for the best-fit classifying formu- 
lae is shown in TableUJa and the ROC curve in Fig.Os. From 
the matrix, the classifier is clearly more successful at iden- 
tifying CVs than blazars. This may reflect stronger class lo- 
calization for CVs in the feature space than for blazars, i.e., 
the distribution of CVs in the feature space is more com- 
pact and therefore a discriminating (bounding) hyperplane 
is more easily defined than for blazers. However, it is more 
likely due to the 10:3 population ratio of CVs and blazers in 
the data se t and a learning b i as - t he so-called test distribu- 
tion effect (|Weiss fc ProvostI (|2003l )) - that this has created 
in the classifier, i.e., with more exposure to CVs, the algo- 
rithm has preferentially evolved to classify them. We defer 
further discussion of this issue to section 5. 

MIC analysis of the feature set again shows a num- 
ber of expected significant relationships, i.e., flux ratios and 
quasar statistics, although the correlation with the respec- 
tive regression coefficients is much less than in the RR Lyrae 
vs W UMa case, which may be related to the lack of peri- 
odic features. The relationships also tend to be nonlinear 
but monotonic. In terms of associations with the class vari- 
able, only three significant features were found with no clear 
indication of nonlinearity or non-monotonicity. 

3.3.3 SN la vs. core-collapse SN 

Spectroscopic confirmation of supernovae candidates can be 
resource intensive and becomes intractable with the increas- 
ingly large numbers expected from the next generation of 
wide-field surveys, e.g., a few hundred thousand from Pan- 
STARRS and LSST. The Supernova Photome tric Classi- 
fication Challenge (SPCC; iKessler et all (|2010l )) aimed to 
improve the state-of-the-art of supernova classification al- 
gorithms based solely on photometric data, and, in partic- 
ular, separating out SNe Type la, which is important for 
cosmological studies. We tested our methodology on a set 
of 836 SNe la and 427 core-collapse SNe (lb, Ic, Iln, lip) 
light curves from the SPCC data set. Again, since we do not 

* http://nesssi.cacr.caltcch.edu/catalina/CVscrvicc/CVtabIc.html 
^ http://nesssi.cacr.caltcch.edu/catalina/Blazars/Blazar.html 



Table 4. The combined best-fit confusion matrices for the three 
binary classification cases using Eureqa and lOx-cross-validation. 
The results are the sums of each cross-validation run. 



(a) 


RR Lyrae 


W UMa 


RR Lyrae 
W UMa 


464 (96.3%) 
7 (2.5%) 


18 (3.7%) 
456 (98.5%) 


(b) 


CV 


Blazar 


CV 
Blazar 


368 (91.1%) 
45 (37.5%) 


36 (8.9%) 
75 (62.5%) 


(c) 


SNe la 


CC SNe 


SNe la 
CC SNe 


773 (92.5%) 
250 (58.6%) 


63 (7.5%) 
177 (41.4%) 



believe these to be periodic, we used only non-periodic fea- 
tures to characterize the light curves, giving 25-dimensional 
feature vectors. The overall best-fit formula was: 



where xio is the median absolute deviation, X13 is the per- 
centage difference between the extremum flux and the me- 
dian, X15 is the chi-squared quasar statistic, and xis is the 
significance of the chi-squared non-quasar statistic respec- 
tively. The combined confusion matrix for the best-fit clas- 
sifying formulae is shown in Table |3J; and the correspond- 
ing ROC curve in Fig. [9]:. The matrix again shows a strong 
classification bias for the more numerous class, although this 
time the population ratio is only ~2:1. 

The MIC results show expected relationships between 
flux ratios and measures of variability, all of which are 
mainly linear, monotonic, and in line with the respective 
regression coefficient results. More interestingly, though, is 
that there are no really significant associations between the 
class variable and the features - the most significant, the 
ratio between the (95th - 5th) flux percentile and the me- 
dian, is only significant at the ~3% level. This may indicate 
that the conventional set of features used to characterize 
light curves are inappropriate for those of supernovae, which 
would also explain the poor performance of the classifier - 
a clear discriminating hyperplane cannot be defined in this 
feature space. 

4 FEATURE SELECTION 
4.1 Posterior feature selection 

In examining data mining systems, it is often worth ask- 
ing whether a successful outcome is due to the power of 
the particular algorithm under consideration or due to a 
comprehensive training data set being used, with which any 
algorithm worth its salt would achieve good results. (Al- 
ternatively, a poor result with an otherwise excellent algo- 
rithm may be due to a limited training set). One way to 
answer this is to consider which features in the data set are 
employed by the algorithm and ask whether the features 
selected show any degree of sense - do they provide addi- 
tional insight into the data set - or should we regard them 
purely as phenomenological selections that just happen to 
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Figure 9. This siiows the ROC curves - true positive rate (TPR) vs. false positive rate (FPR) as the discrimination threshold is varied 
— for the three binary classification cases using Eureqa: (a) RR Lyrae vs. W UMa; (b) CV vs. blazar; and (c) SN la vs. core-collapse 
SNe. The best possible classifier would yield a point in the upper left corner (0,1) of the plot. The area under the curve (AUG) is an 
accepted performance metric for a ROC curve and related to the Mann- Whitney U statistic. The line y = x represents the scenario of 
randomly guessing the class. 



give good results? This is particularly so when only a sub- 
set of all the available features actually end up being used, 
i.e., there is some degree of feature selection present in the 
process, whether explicit or implicit (embedded), as hap- 
pens with evolutionary-based algorithms and the decision 
tree work we have compared Eureqa against. 

The MIC statistics already give some handle on the rel- 
ative importance of different bivariate relationships within 
the feature space and of particular features relative to the 
class variable in the classification examples. However, we 
would also like to be able to consider larger multivariate 
subsets of features, both for feature ranking according to 
some metric and to identify the optimal subset of features 
that characterizes the problem. We have considered two fur- 
ther specific feature selection techniques to compare against 
the results of Eureqa, MIC and the decision trees and deter- 
mine whether there is any consistency in the features used 
by the different techniques: consensus would imply that the 
shared features are relevant to understanding the problem 
under consideration. 



4-1.1 Sequential backward ranking 

Sequential backward ranking (SBR) is an unsupervised fea- 
ture selection method based on the entropy measure that 
aims to progressively reduce the dimension of a data set in 
an optimal fashion, i.e., at each stage, the reduced data set 
represents the best approximation to the original. It works 
thus: 

(i) Start with a full feature set T which characterizes a 
data set 

(ii) For each feature, f £ T, define a set of subsets, {J-f}, 
such that: Tj=T — j 

(iii) Select the feature which maximizes the quantity: 
S(J-') — S{J-f^) where S{J-) is the Shannon entropy (see 
below) of the feature set 

(iv) Update T such that: T = J- — jm 

(v) Repeat steps (ii) - (iv) until there is only one feature 
left 

The output is an ordered list of features in descending order 
of their entropy contribution or their significance. A super- 
vised version can also be constructed by replacing the con- 



straint in step (iii) with minimizing the classification error 
between that for T and Tf^ . 

In order to apply this technique, we must first define and 
evaluate the Shannon entropy of a feature set. Traditional 
estimators of the Shannon entropy, H{X) of a multivariate 
data set, X — {X\,X2, ■ ■ ■ , Xn}, require knowledge of the 
joint probability distribution of all the X„: 



H{X^ 



,Xn)\og[P{xi,...,Xn)] 



which is usually a fairly intr actable problem. However, 
iKozachenko fc Leonenkol (|l987l ) provide an alternative es- 
timator based on the distance to the k"'-nearest neighbour: 

n 

//(Xi, . . . , X„) = -i^{k) -f i^{n) -f log Cd + - y log(eO 

1=1 

where V is the digamma function {ip(x) = r'(a::)/r(a::)), is 
the volume of the d-dimensional unit ball (c^ — 7r'*''^/r(l -f 
d/2)) and is twice the distance from Xi to its k*''-nearest 
neighbour respectively. The error on the estimate is typically 
~ fc/iV or ~ k/N\og{N/k). 



4-1.2 Mmimum-redundancy-maximum-relevance 

In feature selection, it has been recognized that the combi- 
nations of individually good features do not necessarily lead 
to good overall performance, i .e., the m be st features are 
not the best m features (e.g., I Cover] (|l974l )). One way to 
tackle this is to consider simultaneously the relevance - the 
average mutual information between a set of features and 
a classification variable - and the redundancy - the aver- 
age mut ual information betw e en pa irs of features - of a fea- 
ture set. IPeng. Long fc Dind (|2005l ) proposed such a crite- 
rion (minimum-redundancy-maximum-relevance (mRMR)): 



max 

s 



1 

W\ 



J2 ^"^f^- 



/iGS 



1 



where the feature set S has individual features fi, c is the 
classification variable and MI the mutual information (see 
eqn. (1)) respectively. This approximates maximizing the 
mutual information between the joint distribution of the se- 
lected features and the classification variable but in terms of 
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bivariate quantities and not much harder to deal with multi- 
variate ones. Note that mRMR employs data discretization 
as a preprocessing step for continuous data since it is of- 
ten difficult to compute the integral form of eqn. (1) in a 
continuous space with limited numbers of samples. 



4-1.3 Feature comparison 

Table [5] gives the (ordered) lists of features selected by each 
method - MIC, Eureqa, decision tree, SBR, and mRMR 
- for the three data sets used in the binary classification 
problems. We did not consider the Hertzsprung-Russell di- 
agram or fundamental plane of elliptical galaxies data sets 
since these both involved too few variables to show any dif- 
ferences between the methods. The MIC results are those 
deemed to be statistically significant at the 10"'^ level rela- 
tive to the classification variable (see section 2.1 for details). 
Similarly, the mRMR results are also relative to the classifi- 
cation variable. Feature data for the mRMR algorithm was 
discretized into three states at the positions ± a (where 
is the mean and a the standard deviation respectively) such 
that it takes -1 if it is less than j-i — a, 1 if larger than ^ + a 
, and if otherwise. The SBR and mRMR results also just 
list the top five features in each case. Finally, the Eureqa 
and decision tree entries list the variables used without any 
implied ranking. 

The disparate nature of the rankings - ordered and un- 
ordered, different numbers of variables - makes any formal 
quantitative analysis, such as ranking aggregation, difficult. 
However, there a number of general comparisons that can be 
made. The features employed by Eureqa and decision trees 
are generally different - they only share one feature in each 
of the three problems - and decision trees are less parsimo- 
nious with more features. A similar lack of commonality is 
shown between MIC and both SBR and mRMR, although 
SBR and mRMR show a marginally stronger degree of over- 
lap, which should not be that surprising since they are both 
rely directly on entropy-related measures. Table [6] gives the 
relative fractions of the features selected by Eureqa and de- 
cision trees that are also identified by MIC, SBR and mRMR 
respectively. This suggests that there is more agreement be- 
tween Eureqa and the three explicit feature selection meth- 
ods than between decision trees and the same techniques, 
although none of them display any particularly strong asso- 
ciation. 

The differing nature of the classes of object in the three 
experiments leads one to expect that specific features or 
types of features would be selected in each and this does 
seem to be the case. The RR Lyrae/W UMa results include 
periodic measures for all methods except mRMR, reflect- 
ing the period-amplitude relationship, and the CV/Blazar 
results include either the QSO or non-QSO statistic for all 
methods. More interestingly, the QSO statistics are also se- 
lected by Eureqa, SBR and mRMR as discriminating fea- 
tures in the SNe data set (the other features selected are 
not common across the methods). These statistics measure 
the applicability of a damped random walk model to a light 
curve versus it exhibiting temporally uncorrelated variabil- 
ity. Although neither behavior is shown in either type of su- 
pernova light curve, both exhibiting a brightening and then 
decaying pattern with additional features in core-collapse 



supernovae, there must be some further information inher- 
ent in the light curve s to wh ich these statistics are sensitive. 

iGraczvk fc E^ (|201(]| ) propose that eclipsing binaries 
can be identifled in large photometric surveys based on the 
skew and kurtosis of their light curves. MIC finds some de- 
gree of relationship between the skew and kurtosis but not 
a nonlinear one and a stronger nonlinear dependence be- 
tween the class type and the skew than the kurtosis (the 
dependence of which is actually not statistically significant). 
mRMR identifies both skew and kurtosis, however, as sig- 
nificant features. Neither are flagged by Eureqa or decision 
trees, although in the former case, the "survivability" of the 
period - MAD solution dominates that of other possible rela- 
tionships. Restricting Eureqa to non-periodic features gives 
a set of formulae all dependent on the skew and variously 
the kurtosis and percentile ratios. A viable Eureqa-based 
feature selection strategy, particularly for feature-rich data 
sets, might therefore be to progressively restrict the set of 
features that are considered in any single iteration. 

It is also worth noting which features are not selected at 
all or only once by one method: flux ratios and the Stetson 
K variability index. These statistics can be broadly thought 
of as quantitative measures of the shape of the light curve 
and there is indeed little discrimination to be found in the 
shapes of the three binary categories of object alone, e.g., 
although RR Lyrae AB and W UMa are relatively easily 
distinguished from their phased light curves, RR Lyrae C 
are not. CV and blazar light curves are similarly not easily 
separable, particularly when sparsely and irregularly sam- 
pled such as the CRTS light curves. SN la and core-collapse 
SNe can be differentiated if enough of the light curve has 
been sampled but this is not necessarily the case with many 
of the examples in the SPCC. 

This suggests that the current features which aim to 
capture the shape of a light curve are neither robust enough 
in the presence of noisy inhomogeneous data nor do they 
capture enough information to act as significant discrimina- 
tors. Clearly further research in this area would be extremely 
beneflcial to the next generation of time domain surveys. 



5 DISCUSSION 

The results in the previous sections show that automated 
discovery systems of relationships can identify and char- 
acterize physically meaningful structure in data. The fact 
that known relationships in the Hertzsprung-Russell dia- 
gram, the fundamental plane of elliptical galaxies and the 
period-amplitude plane of RR Lyrae stars can be automat- 
ically recovered is very encouraging, particularly as more 
complex and accurate (with smaller errors) expressions are 
possible. The fitness metrics, however, provide a good bal- 
ance between accuracy and parsimony, ensuring high quality 
general hypotheses. It should also be noted that the discov- 
ery process consists not only of identifying the best func- 
tional expressions but also the most relevant subset of vari- 
ables. There are, of course, other specific feature selection 
algorithms, such as those mentioned in section 4, that could 
be have been applied to the data sets prior to the application 
of our methods as a preprocessing step. 

Perhaps one of the more surprising applications of these 
systems is as part of efficient binary classifiers, particularly 
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Table 5. The features selected by the various feature selection methods discussed in the text. For MIC, SBR and mRMR, the lists are 
in descending order of feature significance (indicated by *) . The features to which the variables correspond are described in Table IAT] 

Method RR Lyrae/W UMa CV / Blazar SNIa/CCSNc 

MIC* X24, 2:261 2:34, a;i4,a;28, 2:37, xi9,X3g 2:17, X22, 2:21 

Eureqa x\o,X24, xi,xi7 Xi3,xi5,xi8 

Decision tree xio, 2:14, 2:22, 2:21, a;35i 2:62 2;i, 2:2, 2:6, xg, 2:15, 2:19, 2:22 2:5, xio, X12, X13, 2:14 

SBR* 2:9, X52, X43, X20, 2:32 2;i8, 2;i6, 2;i5, 2;22, 2:20 2:22, 2:ig, xie, 2:15 , 2:9 

mRMR* X9, X20, 2:11, X19, 2;i3 X2, 2:17, 2:15 , 2:12 , 2:7 2:1, X2, X22, 2:15, xg 



Table 6. The relative fractions of shared features between those identified by Eureqa or decision trees and those that the three ordered 
algorithms have provided. 



Method 


RR Lyrae/W UMa 


CV / Blazar 


SNIa/CCSNe 


(MIC/SBR/mRMR) 


(8/5/5) 


(3/5/5) 


(0/5/5) 


Eureqa 


50%/0%/0% 


50%/0%/50% 


0%/67%/33% 


Decision tree 


16%/0%/0% 


14%/28%/14% 


0%/0%/0% 



as it has been said that we should not expect a lightly pa- 
rameterized form for mapping between feature space and 
class space (Richards 2011, private communication). To get 
some idea of how competitive thi s approach is, we can com- 
pare directly with the results of iDonalek et al.] (|2013l ') who 
have applied C4.5 decision trees using the Gini diversity in- 
dex as the splitting criterion to the same data sets as in 
section 3. Table [7] gives the relative performances of the two 
approaches in terms of purity - the fraction of true classifica- 
tions recovered out of all objects assigned to that class - and 
efficiency - the fraction of true classifications recovered out 
of all objects actually belonging to that class. For example 
from Table g] 464 true RR Lyrae are recovered, 471 (464 -|- 
7) objects are assigned a class of RR Lyrae, and there are 
482 (464 + 18) RR Lyrae in the data set - this gives a purity 
of 464 / 471 (98%) and an efliciency of 464 / 482 (96%). It 
can be seen that for four of the classes, the Eureqa-based 
approach performs as well as the decision tree one, partic- 
ularly in terms of efficiency (completeness). For additional 
comparison, the best results reported in SPC C were 96% ef- 
ficienc y and 79% purity for classifying SNe la (|Kessler et al.l 
l|2010l )). However, as expected, it does not perform so well 
with the two minority class populations. 

Imbalanced data sets, such as the CV/blazar and 
SNIa/CCSN examples, may reflect natural class distribu- 
tions - one type of object is just more common than the 
other ~ or may be the result of parameter/feature space 
sampling - observations are probing regions preferentially 
occupied by one class, even if the overall population sizes are 
similar. We chose to use as much data as possible in both the 
CV/blazar and SNIa/CCSN cases, which probably involves 
a mixture of both these effects. With such data sets, minor- 
ity class examples are classified incorrectly much more often 
than majority class examples (|Weiss fc ProvostI (|2003l )). as 
we found in section 3. Determining what the correct distri- 
bution is for a learning algorithm in this co ntext is an ac - 
tive area of research in machine learning (see lChawlal (|201Cll ) 
for an overview). Some practitioners believe that the nat- 
urally occurring marginal class distribution should be used 
so that new examples will be classified using a model built 



Table 7. The overall success rates for the Eureqa-based classifiers 
and the decision trees of Donalek et al. (2012) 



Data set 


Eureqa 


Decision tree 




Purity 


Efficiency 


Purity 


Efficiency 


RR Lyrae 


98% 


96% 


95% 


95% 


W UMa 


97% 


99% 


96% 


96% 


CV 


89% 


91% 


92% 


92% 


Blazar 


68% 


63% 


87% 


83% 


SN la 


76% 


93% 


90% 


96% 


CC SN 


74% 


41% 


92% 


80% 



from the same underlying distribution. Others feel that the 
training set should contain an increased percentage of mi- 
nority class examples or the induc ed classifier will n o t clas - 
sify minority class examples well. IWeiss fc ProvostI (j2003h 
show that the choice of training distribution can depend on 
the performance measure used with the natural distribution 
for predictive accuracy (confusion matrices) and a balanced 
distrib ution for ROC curves respectively. Although, IChawlal 
{201(f) argues that predictive accuracy may be an inappro- 
priate performance measure for imbalanced data sets. Our 
results are certainly consistent with all these findings. 

Finally, it is worth considering the limitations of auto- 
mated discovery systems. Eureqa assumes that relationships 
must be expressible as invariant (conserved) quantities in 
a partial differential metric space. However, this would not 
necessarily be true for systems that might be exhibiting frac- 
tal behaviour, such as scale dependency in the correlation 
properties of the large-scale distribution of galaxies (e.g., 
IJovce et all (|2005l )). or ch aotic or stochastic activity, such 
as in accretion discs (e.g., iKarak. Dutta fc Mukhopadhvavl 
(|2010l )). It is computationally expensive and more than lin- 
early so as the size of the search space is increased with the 
number of building blocks employed in search formulae. It 
can also suffer from the general limitations of evolutionary 
algorithms, requiring time to move out of local minima and 
the nature of the fitness landscape being unclear so it is dif- 
ficult to determine how well the algorithm might be doing. 
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Although there is no guarantee that it will find a good solu- 
tion nor that this will be the optimum, the results we have 
shown demonstrate that it is a useful technique to consider. 

MIC and its associated statistics have fewer assump- 
tions in just looking for and broadly characterizing bivari- 
ate relationships through their maximal mutual information 
- note they are not directly related to mutual information 
as they perform well in situations whe re other direc t mutu al 
information-based measures do not (jReshef et all lj201lf )). 
Ideally the MIC algorithm would optimize over all possible 
grid partitions of a data set but the computational expense 
is avoided with a dynamic programming approach that ap- 
pears to approximate well in most cases. It is unclear, how- 
ever, how well these perform in the presence of outliers or 
how large data sets need to be for stable estimates. Prob- 
ably the biggest current limitation to MIC is its bivariate 
nature - generalizations to higher dimensions are necessary 
to search for multivariate relationships (these will not neces- 
sarily show in a two-dimensional projection) but this comes 
at the additional expense of both finding an optimal hyper- 
grid partitioning of the data set and also using multivariate 
mutual information which is a poorly understood concept. 



6 CONCLUSIONS 

In this paper we have demonstrated that automated discov- 
ery systems can uncover significant (non-trivial) relation- 
ships in high-dimensional complex data parameter spaces. 
As with any relationship found in data, whether by an au- 
tomated system or a human, these may or may not have 
a physical meaning or cause - correlation does not imply 
causation - and they may be due to some incidental prop- 
erties of a given data set. The interpretation and evaluation 
of their possible physical significance remains in the hands 
of a human scientist. 

Whilst the ones we have shown may not be the most 
scientifically exciting, being more for illustrative purposes 
than anything, we should bear in mind that astronomy is 
an already relationship-rich science. Many of these are ex- 
pected or predictable associations, given what we already 
understand about the nature of (astro)physical systems. In 
contrast, systems biology and similar sciences, wherein lie 
the origins of these automated discovery techniques, are 
relationship-poor and there is potentially more upfront im- 
pact to be had by applying them in that particular context. 
Astronomy perhaps stands to benefit more from them as dis- 
covery filters, tackling the curse of dimensionality of high- 
dimensional parameter spaces and reducing the number of 
relationships to be examined to only the most significant, 
than as ab initio discovery engines. 

Although these systems represent the cutting-edge of 
currently applicable tools, this is very much an initial entry 
point for their application to astronomy. Such tools will very 
likely become both more powerful and also more prevalent 
with time, given the data challenges all sciences are facing. 
Expanded abilities such as not just relying on brute-force 
searches of feature spaces but being able to incorporate do- 
main knowledge, both as additional features, e.g, distance 
to nearest galaxy for supernovae, or as rulesets, and make 
inferences leading to more interesting discoveries are active 
areas of research. 



The importance of such approaches as we are faced 
with ever more parameter/feature rich data sets cannot 
be underestimated. In particular, the possibilities of high- 
dimensional scientific relationships, particularly those that 
do not necessarily reveal themselves in lower dimension rep- 
resentations, can only really be investigated using auto- 
mated discovery techniques which are (relatively) uncon- 
strained in their exploration of parameter space. These tools 
promise to cherry pick the higher hanging fruit of LSST, 
SKA and future surveys. 
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APPENDIX A: CHARACTERIZING FEATURES 

A variety of statistically- and morphologically-based fea- 
tures have been used to character i ze light curves in t h e liter - 
ature, e.g., [Richards et~aLl (|201lf ). iDebosscher et all (|2007h . 
Table \K\\ summarizes the statistics that we have used in this 
analysis. 

This paper has been typeset from a T^j^X/ ffl^rjX file prepared 
by the author. 
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Table Al. This table describes the features used to characterize the light curves used in this analysis. 



Name 



Variables Description 



Amplitude xi 

Beyond 1 std X2 

Flux percentile ratio (60 - 40) X3 

Flux percentile ratio (67.5 - 32.5) X4 
Flux percentile ratio (75 - 25) 

Flux percentile ratio (82.5 - 17.5) x% 

Flux percentile ratio (90 - 10) X7 
Linear trend 

Maximum slope xg 

Median absolute deviation x\q 

Median buffer range percentage x\\ 

Pair slope trend x\2 

Percent amplitude X13 

Percent difference flux percentile X14 

QSO xi5 

Skew xig 

Small kurtosis X20 

Standard deviation X21 

Stetson J X22 

Stetson K X23 

Lomb-Scargle peaks X24 

Frequency parameters X34 — xq2 



■ xis, 



xyi 



Half the difference between the minimum and maximum magnitudes 

The percentage of points beyond one standard deviation from the weighted mean 

The ratio of flux percentiles: (60th - 40th) to (95th - 5th) 

The ratio of flux percentiles: (67.5th - 32.5th) to (95th - 5th) 

The ratio of flux percentiles: (75th - 25th) to (95th - 5th) 

The ratio of flux percentiles: (82.5th - 17.5th) to (95th - 5th) 

The ratio of flux percentiles: (90th - 10th) to (95th - 5th) 

The slope of a linear fit to the light curve 

The maximum absolute flux slope between two consecutive observations 

The median discrepancy of the fluxes from the median flux 

The percentage of fluxes within 10% of the amplitude from the median 

The percentage of the last 30 pairs of consecutive flux measurements that have a positive slope 
The largest percentage difference between cither the maximum or minimum flux and the median 
The ratio of the (95th - 5th) flux percentile to the median flux 

The chisq/qso and chisq/non-qso statistics and their significance levels from the quasar 

(non-)variability metric of^utler^^^loom (201^) 

The skew of the magnitudes 

The kurtosis of the magitudes 

The standard deviation of the magnitudes 

The Welch-Stetson J variability index with an exponential weighting scheme 
The Welch-Stetson K variability index 

The periods and false-peak detection probabilities of the top 5 peaks in the Lomb-Scargle 
periodogram of the light curve 

The frequency analysis statistics described in ^ebosscher_e^^l^ (2007): the slope of the linear 
trend, the 3 prime frequencies and their first four harmonics (amplitude and phase for each) 
and the ratio of the variances of the light curve after and before subtraction of a harmonic fit 
with the first frequency. 



